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Abstract 

Recent advances in Bayesian reinforcement learn- 
ing (BRL) have shown that Bayes-optimality is 
theoretically achievable by modeling the envi- 
ronment's latent dynamics using Flat-Dirichlet- 
Multinomial (FDM) prior. In self-interested multi- 
agent environments, the transition dynamics are 
mainly controlled by the other agent's stochastic 
behavior for which FDM's independence and mod- 
eling assumptions do not hold. As a result, FDM 
does not allow the other agent's behavior to be 
generalized across different states nor specified us- 
ing prior domain knowledge. To overcome these 
practical limitations of FDM, we propose a gener- 
alization of BRL to integrate the general class of 
parametric models and model priors, thus allowing 
practitioners' domain knowledge to be exploited to 
produce a fine-grained and compact representation 
of the other agent's behavior. Empirical evalua- 
tion shows that our approach outperforms existing 
multi-agent reinforcement learning algorithms. 



1 Introduction 

In reinforcement learning (RL), an agent faces a dilemma 
between acting optimally with respect to the current, possi- 
bly incomplete knowledge of the environment (i.e., exploita- 
tion) vs. acting sub-optimally to gain more information about 
it (i.e., exploration). Model-based Bayesian reinforcement 
learning (BRL) circumvents such a dilemma b y consider- 
ing the notion of Bayes-optimality I Duff, 20031: A Bayes- 
optimal policy selects actions that maximize the agent's ex- 
pected utility with respect to all possible sequences of future 
beliefs (starting from the initial belief) over candidate mod- 
els of the environment. Unfortunately, due to the large belief 
space, the Bayes-optimal policy can only be approximately 
derived under a simple choice of models and model priors. 
For example, the F lat-Dirichlet-Multinomial (FDM) prior 
iPoupart et ah, 20061 assumes the next-state distributions for 
each action-state pair to be modeled as independent multino- 
mial distributions with separate Dirichlet priors. Despite its 
common use to analyze and benchmark algorithms, FDM can 
perform poorly in practice as it often fails to exploit the struc- 



tured information of a pro blem I Asmuth and Littman, 2011 
Araya-Lop ez et al, 2012) . 



To elaborate, a critical limitation of FDM lies in its inde- 
pendence assumption driven by computational convenience 
rather than scientific insight. We can identify practical exam- 
ples in the context of self-interested multi-agent RL (MARL) 
where the uncertainty in the transition model is mainly caused 
by the stochasticity in the other agent's behavior (in different 
states) for which the independence assumption does not hold 
(e.g., motion behavior of pedestrians [Natarajan et al., 2012a; 
2012b]). Consider, for example, an application of BRL in the 
problem of placing static sensors to monitor an environmental 
phenomenon: It involves actively selecting sensor locations 
(i.e., states) for measurement such that the sum of predictive 
variances at the unobserved locations is minimized. Here, the 
phenomenon is the "other agent" and the measurements are 
its actions. An important characterization of the phenomenon 
is that of the spatial correlation of measurements between 
neighboring locations/states [Low et al, 2007; 2008; 2009; 
2011; 2012; Chen et al, 2012; Cao et al, 2013], which makes 
FDM-based BRL extremely ill-suited for this problem due to 
its independence assumption. 

Secondly, despite its computational conve nience, F DM 
does not pe rmit generalization across states [Asmuth and 
Littman, 201 If, thus severely limiting its applicability in 



practical problems with a large state space where past obser- 
vations only come from a very limited set of states. Interest- 
ingly, in such problems, it is often possible to obtain prior do- 
main knowledge providing a more "parsimonious" structure 
of the other agent's behavior, which can potentially resolve 
the issue of generalization. For example, consider using BRL 
to derive a Bayes-optimal policy for an autonomous car to 
navigate successfully among human-driven vehicles [Hoang 
and Low, 2012; 2013b] whose behaviors in different situa- 
tions (i.e., states) are governed by a small, consistent set of 
latent parameters, as demonstrated in the empirical study of 
Gipps 1 1981 1. By estimating/learning these parameters, it is 
then possible to generalize their behaviors across different 
states. This, however, contradicts the independence assump- 
tion of FDM; in practice, ignoring this results in an inferior 
performance, as shown in Section [4] Note that, by using pa- 
rameter tying [Poupart et al., 2006], FDM can be modified to 
make the other agent's behavior identical in different states. 
But, this simple generalization is too restrictive for real-world 



problems like the examples above where the other agent's be- 
havior in different states is not necessarily identical but re- 
lated via a common set of latent "non-Dirichlet" parameters. 

Consequently, there is still a huge gap in putting BRL 
into practice for interacting with self-interested agents of un- 
known behavio rs. To the best of our know l edge, this is first 
investigated by Chalkiadakis and Boutilier 1 2003 1 who offer 
a myopic solution in the belief space instead of solving for 
a Bayes-optimal policy that is non-myopic. Their proposed 
BPVI method essentially selects actions that jointly maxi- 
mize a h euristic aggregation of myopic value of perfect infor- 
mation QDearden et at, 1 9981 and an average estimation of 
expected utility obtained from solving the exact MDPs with 
respect to samples drawn from the posterior belief of the other 
agent's behavior. Moreover, BPVI is restricted to work only 
with Dirichlet priors and multinomial likelihoods (i.e., FDM), 
which are subject to the above disadvantages in modeling the 
other agent's behavior. Also, BPVI is demonstrated empiri- 
cally in the simplest of settings with only a few states. 

Furthermore, in light of the above examples, the other 
agent's behavior often needs to be modeled differently de- 
pending on the specific application. Grounding in the context 
of the BRL framework, either the domain expert struggles 
to best fit his prior knowledge to the supported set of mod- 
els and model priors or the agent developer has to re-design 
the framework to incorporate a new modeling scheme. Ar- 
guably, there is no free lunch when it comes to modeling the 
other agent's behavior across various applications. To cope 
with this difficulty, the BRL framework should ideally allow 
a domain expert to freely incorporate his choice of design in 
modeling the other agent's behavior. 

Motivated by the above practical considerations, this pa- 
per presents a novel generalization of BRL, which we call 
Interactive BRL (I-BRL) (Section p), to integrate any para- 
metric model and model prior of the other agent's behavior 
(Section [2) specified by domain experts, consequently yield- 
ing two advantages: The other agent's behavior can be rep- 
resented (a) in a fine-grained manner based on the practition- 
ers' prior domain knowledge, and (b) compactly to be gener- 
alized across different states, thus overcoming the limitations 
of FDM. We show how the non-myopic Bayes-optimal policy 
can be derived analytically by solving I-BRL exactly (Sec- 
tion 3.1 1 and propose an approximation algorithm to compute 
it efficiently in polynomial time (Section 3.2 1. Empirically, 
we evaluate the performance o f I-BRL against that of BPVI 

1 Chalkiadakis and Boutilier, 2003 1 using an interesting traffic 
problem modeled after a real-world situation (SectionE). 

2 Modeling the Other Agent 

In our proposed Bayesian modeling paradigm, the oppo- 
nent'aHbehavior is modeled as a set of probabilities p® h (X) — 
Pr(v\s, h, A) for selecting action v in state s conditioned on 
the history h = {si,Ui, Vi}f =1 of d latest interactions where 
m is the action taken by our agent in the i-th step. These 
distributions are parameterized by A, which abstracts the ac- 



tual parametric form of the opponent's behavior; this abstrac- 
tion provides practitioners the flexibility in choosing the most 
suitable degree of parameterization. For example, A can sim- 
ply be a set of multinomial distributions A = {6 v ah } such that 
PshW = ^sh if no prior domain knowledge is available. Oth- 
erwise, the domain knowledge can be exploited to produce a 
fine-grained representation of A; at the same time, A can be 
made compact to generalize the opponent's behavior across 
different states (e.g., SectionR). 

The opponent's behavior can be learned by monitoring the 
belief 6(A) = Pr(A) over all possible A. In particular, the be- 
lief (or probability density) 6(A) is updated at each step based 
on the history h o (s, u, v) of d + 1 latest interactions (with 
(s, u, v) being the most recent one) using Bayes' theorem: 

6^(A) cc p v sh (X)b(X). (1) 

Let s = (s, h) denote an information state that consists of the 
current state and the history of d latest interactions. When 
the opponent's behavior is stationary (i.e., d = 0), it follows 
that s = s. For ease of notations, the main results of our work 
(in subsequent sections) are presented only for the case where 
d = (i.e., s = s); extension to the general case just requires 
replacing s with s. In this case, ([T]) can be re-written as 

6 S U (A) ex p v s (X)b(X). (2) 

The key di fference between our Bayesian modeling paradigm 
and FDM [Poup art et al, 200"6) is that we do not require 6(A) 
and Pj(A) to be, respectively, Dirichlet prior and multino- 
mial likelihood where Dirichlet is a conjugate prior for multi- 
nomial. In practice, such a conjugate prior is desirable be- 
cause the posterior 6^ belongs to the same Dirichlet family 
as the prior 6, thus making the belief update tractable and 
the Bayes-optimal policy efficient to be derived. Despite its 
computational convenience, this conjugate prior restricts the 
practitioners from exploiting their domain knowledge to de- 
sign more informed priors (e.g., see Sectionffl). Furthermore, 
this turns out to be an overkill just to make the belief update 
tractable. In particular, we show in Theorem [T] below that, 
without assuming any specific parametric form of the initial 
prior, the posterior belief can still be tractably represented 
even though they are not necessarily conjugate distributions. 
This is indeed sufficient to guarantee and derive a tractable 
representation of the Bayes-optimal policy using a finite set 



of parameters, as shall be seen later in Section 3.1 



'For convenience, we will use the terms the "other agent" and 
"opponent" interchangeably from now on. 



Theorem 1 If the initial prior b can be represented exactly 
using a finite set of parameters, then the posterior b' condi- 
tioned on a sequence of observations {(si,Vi)}" =1 can also 
be represented exactly in parametric form. 
Proof Sketch. From Q, we can prove by induction on n' that 
6'(A) oc $(A)6(A) (3) 

sesvev 

where ip* = Y^=i $sv(si, Vi) and S sv is the Kronecker delta 
function that returns 1 if s = Sj and v = Vi, and otherwisq^] 

intuitively, < 3?(A) can be interpreted as the likelihood of observ- 
ing each pair (s, v) for ip^ times while interacting with an opponent 
whose behavior is parameterized by A. 



From Q, it is clear that b' can be represented by a set of 
parameters {ip^} s ,v and the finite representation of b. Thus, 
belief update is performed simply by incrementing the hyper- 
parameter ip* according to each observation (s, v). □ 

3 Interactive Bayesian RL (I-BRL) 

In this section, we first extend the proof techniques used 
in [ Poupart et al., 2006) to theoretically derive the agent's 



Bayes-optimal policy against the general class of parametric 
models and model priors of the opponent's behavior (Sec- 
tion|2]i. In particular, we show that the derived Bayes-optimal 
policy can also be represented exactly using a finite number of 
parameters. Based on our derivation, a naive algorithm can be 
devised to compute the exact parametric form of the Bayes- 
optimal policy (Section 3.1 1, Finally, we present a practical 



algorithm to efficiently approximate this Bayes-optimal pol- 
icy in polynomial time (with respect to the size of the envi- 
ronment model) (Section [3~2| i. 

Formally, an agent is assumed to be interacting with its 
opponent in a stochastic environment modeled as a tuple 
(S, U, V, {r s }, {p™}, {p v s (X)}, 4>) where S is a finite set of 
states, U and V are sets of actions available to the agent 
and its opponent, respectively. In each stage, the immedi- 
ate payoff r s (u, v) to our agent depends on the joint action 
(u,v) 6 U X V and the current state s € S. The envi- 
ronment then transitions to a new state s' with probability 
-Ps"( s ') = Pr(s'|s, u, v) and the future payoff (in state s') is 
discounted by a constant factor < <fi < 1, and so on. Fi- 
nally, as described in Section[2j the opponent's latent behav- 
ior {Ps(A)} can be selected from the general class of para- 
metric models and model priors, which subsumes FDM (i.e., 
independent multinomials with separate Dirichlet priors). 

Now, let us rec all that the ke y idea underlying the notion of 
Bayes-optimality | [Duff, 2003 1 is to maintain a belief &( A) that 
represents the uncertainty surrounding the opponent's behav- 
ior A in each stage of interaction. Thus, the action selected 
by the learner in each stage affects both its expected imme- 
diate payoff E\[%2 V Ps(X)r s (u, v)\b] and the posterior belief 
state b v s (A), the latter of which influences its future payoff and 
builds in the information gathering option (i.e., exploration). 
As such, the Bayes-optimal policy can be obtained by maxi- 
mizing the expected discounted sum of rewards V s (b): 



V s (b) 



maX 5Z(P«>&)( r s(u,v) 



,p?(*')VAW 



where (a, b) — f, a(A)6(A)dA. The optimal policy for the 
learner is then defined as a function tt* that maps the belief 
b to an action u maximizing its expected utility, which can 
be derived by solving d5). To derive our solution, we first 
re-state two well-known results concerning the augmented 



belief-state MDP in single-agent RL jPoupart et al, 2 0061, 
which also hold straight-forwardly for our general class of 
parametric models and model priors. 

Theorem 2 The optimal value function V k for k steps-to-go 
converges to the optimal value function V for infinite horizon 
ask-^ oo: || V - ^ fe+1 ||co <4>\\V - V^ . 



Theorem 3 The optimal value function V k (b)for k steps-to- 
go can be represented by a finite set T k of a-f unctions: 



V s k (b) 



max (a s , 6) 



(6) 



Simply put, these results imply that the optimal value V s in 
|5]) can be approximated arbitrarily closely by a finite set T k s 
of piecewise linear a-functions a s , as shown in (|6j. Each a- 
function a s is associated with an action u as yielding an ex- 
pected utility of a s ( A) if the true behavior of the opponent is 
A and consequently an overall expected reward (a s , b) by as- 
suming that, starting from (s, b), the learner selects action u as 
and continues optimally thereafter. In particular, T k and u a& 
can be derived based on a constructive proof of Theorem [3] 
However, due to limited space, we only state the constructive 
process below. Interested readers are referred to Appendix A 
for a detailed proof. Specifically, given {r^} s such that |6]) 
holds for k, it follows (see Appendix A) that 



fc+Vvi - 



vr L (b) 



max (a"*,b) 



(7) 



where t = (t 



s' v )s'£S,vGV 



within G {1, ...,|rj,|}, and 



f (A)^E^( A )(^( u ' w )+^E^'"( A )pr(^')) (8) 



such that a s "' v denotes the t s > v -th a-function in T k ,. Set- 
ting r^' +1 = {a"*} Uit and u a ^t = u, it follows that (|6]l also 
holds for k + 1. As a result, the optimal policy 7r*(6) can 
be derived directly from these a-functions by Tr*(b) = w Q * 
where a'* = argmax a „, r ni (a"', 6) . Thus, constructing 
pfc+i f rom j-jjg previously constructed sets {r^} 5 . essentially 
boils down to an exhaustive enumeration of all possible pairs 
(u, t) and the corresponding application of ([8| to compute 
a"*. Though |8]l specifies a bottom-up procedure construct- 
ing r^ +1 from the previously constructed sets {rj/} s / of a- 
functions, it implicitly requires a convenient parameterization 
for the a-functions that is closed under the application of ((H). 
To complete this analytical derivation, we present a final re- 
sult to demonstrate that each a-function is indeed of such 
parametric form. Note t hat Theore m |4| below generalizes a 
similar result proven in fPoupart et dL, 2006 1, the latter of 
which shows that, under FDM, each a-function can be rep- 
resented by a linear combination of multivariate monomials. 
A practical algorithm building on our generalized result in 
Theorem|4]is presented in Section |3~2| 

Theorem 4 Let $ denote a family of all functions ^(A) Q. 
Then, the optimal value V k can be represented by a finite set 
Fj, of a-functions a J s , for j = 1, . . . , \T k ,\: 



4w=X>*<( A ) 



(9) 



where $^ G «&. So, each a-function a 3 , can be compactly 
represented by a finite set of parameters {cj}™ -rl 

3 To ease readability, we abuse the notations {ci, ^i}YLi slightly: 
Each of, (A) should be specified by a different set {c;, $i}™ x . 



Proof Sketch. We will prove |9} by induction on An Suppos- 
ing |9]) holds for fc. Setting j — t s > v in (|9]l results in 



«!f'"(A) = ^c i $ i (A), 



(10) 



which is then plugged into dSl to yield 



'(A) = £ ciMA) + E E E c ^^( A ) < n > 



v£V 



s'eSvGV \t=l 



where *„(A) = p v s (X), #^(A) = p v s (X)^(X), and the coef- 
ficients c v = r s (u, v) and c v sli = cf>p^ v (s')ci. It is easy to see 
that \P„ G * and *^ G *. So, Q clearly holds for fc+1. We 
have shown above that, under the general class of parametric 
models and model priors (Section|2j, each a-function can be 
represented by a linear combination of arbitrary parametric 
functions i n 3>, which subsume multivariate monomials used 
in QPoupartef aZ.,2006) . □ 

3.1 An Exact Algorithm 

Intuitively, Theorems [3] and [4]provide a simple and construc- 
tive method for computing the set of a-functions and hence, 
the optimal policy. In step fc + 1, the sets T* +1 for all s G S 
are constructed using ( fT0) i and (JTTJ from r* for all s' G S, 
the latter of which are computed previously in step fc. When 
fc = (i.e., base case), see the proof of TheoremHabove (i.e., 
footnote HJ, A sketch of this algorithm is shown below: 

BACKUP(s, fc + 1) 



i. r: 



2 T v ' s 

*" x s,u 



. 9 (A)^^ Cu * u( A)l 

f m 

ft(A)^E c «'i^( A ) 



j=i |r* I 



3. r s ,„, <- r: 



ST\ -pv,s' 



fc+1 



u 



4. r 

u£U 

In the above algorithm, steps 1 and 2 compute the first and 
second summation terms on the right-hand side of ( fTTj ), re- 
spectively. Then, steps 3 and 4 construct T^ +1 = {a"'} u t 
using ( fTT) over all t and u, respectively. Thus, by iteratively 
computing T k s +1 = BACKUP(s, fc + 1) for a sufficiently 
large value of fc, T k s +l can be used to approximate V s arbitrar- 
ily closely, as shown in Theorem[2] However, this naive algo- 
rithm is computationally impractical due to the following is- 
sues: (a) a-function explosion — the number of a-functions 
grows doubly exponentially in the planning horizon length, as 

derived from (7} and ©: | r s +1 | = °([U S > | r s'|] m l^l)> 
and (b) parameter explosion — the average number of pa- 
rameters used to represent an a-function grows by a factor 
of C(|5||y|), as manifested in ( flT) . The practicality of our 
approach therefore depends crucially on how these issues are 
resolved, as described next. 



4 When k — 0, {91 can be verified by letting a — 0. 
5 A B = {a + b\a G A, b G B}. 



3.2 A Practical Approximation Algorithm 

In this section, we introduce practical modifications of the 
BACKUP algorithm by addressing the above-mentioned is- 
sues. We first address the issue of a-function explo sion by 
gener alizing discrete POMDP's PBVI solver iPineau et ah, 
2003 1 to be used for our augmented belief-state MDP: Only 



the a-functions that yield optimal values for a sampled set 

of reachable beliefs B s = {b\, tf%, ■ ■ ■ ,b s } are computed 
(see the modifications in steps 3 and 4 of the PB-BACKUP 
algorithm). The resulting algorithm is shown below: 

PB-BACKUP(S S = {b], b 2 s , ■ ■ ■ , b[ Bsl }, s, fc + 1) 




s(A)^J>tf„(A) 



9} 



(A) = X>!»i(A) 



i=i,...,|r*| 



E 



argmax (#,■,&*) 



ffer*, 



Qi = argmax(g,6*) 
gen „ 



,1-B.I 



Secondly, to address the issue of parameter explosion, each 
a-function is projected onto a fixed number of basis func- 
tions to keep the number of parameters from growing expo- 
nentially. This projection is done after each PB-BACKUP 
operation, hence always keeping the number of parameters 
fixed (i.e., one parameter per basis function). In particular, 
since each a-function is in fact a linear combination of func- 
tions in <t> (Theorem [4}, it is natural to choose these basis 
functions from <EH Besides, it is easy to see from ([3]) that 
each sampled belief b\ can also be written as 



K(X) 



r)K(x)b(x) 



(12) 



where b is the initial prior belief, r/ — l/($* , 6), and $* G 3>. 
For convenience, these {3>g}i=i i.bj are selected as basis 
functions. Specifically, after each PB-BACKUP operation, 
each a s G T k is projected onto the function space defined by 
{^s}i=i \b 3 \- This projection is then cast as an optimiza- 
tion problem that minimizes the squared difference J(a s ) be- 
tween the a-function and its projection with respect to the 
sampled beliefs in B s : 



J(a s ) 



, IB, I / \B.\ \ 2 

^E <«.>«>-£«<««> 



This can be done analytically by letting 



dJ{a s 
dci 



(13) 



and 



solving for a, which is equivalent to solving a linear sys- 
tem Ax = d where x t = a, A H = £J*'i <** , &J> (*j, &*) 
and dj = J2k=i ($s> &«) ( a s,bg). Note that this projection 
works directly with the values (a s , 6^) instead of the exact 
parametric form of a s in d9]l. This allows for a more compact 



See Appendix B for other choices. 



implementation of the PB-BACKUP algorithm presented 
above: Instead of maintaining the exact parameters that repre- 
sent each of the immediate functions g, only their evaluations 

at the sampled beliefs B s = I bl, fr 2 ,, ■ ■ • ,b s > need to be 

maintained. In particular, the values of {(gr, &*)}., ,„ , 

can be estimated as follows: 



(g.bi) 



V J g(XM(X)b(X)d\ 



(14) 



where {A-?} are samples drawn from the initial prior b. 
During the online execution phase, ( fl4] l is also used to com- 
pute the expected payoff for the a-functions evaluated at the 
current belief b'( A) = ?7$(A)6(A): 

E^WE^l' ***(*') 

— ■ (15) 



,b') 



ELi*(^') 



So, the real-time processing cost of evaluating each a- 
function's expected reward at a particular belief is 0(\B s \n). 
Since the sampling of {b l s }, {A J } and the computation 

of < Ei=i c i^l(^) \ can t> e performed in advance, this 
0(\B s \n) cost is further reduced to 0(n), which makes the 
action selection incur 0(\B s \n) cost in total. This is signifi- 
cantly cheaper as compared to the total cost 0(nfc|5| 2 |[/|| V|) 
of online sampling and re-estimating V s incurred by BPVI 
I Chalkiadakis and Boutilier, 20031. Also, note that since 



the offline computational costs in steps 1 to 4 of PB- 
BACKUP(B S , s, fc + 1) and the projection cost, which is cast 
as the cost of solving a system of linear equations, are al- 
ways polynomial functions of the interested variables (e.g., 
I^l, \U\, \V\,n, \B S \), the optimal policy can be approximated 
in polynomial time. 

4 Experiments and Discussion 

In this section, a realistic scenario of intersection navigation is 
modeled as a stochastic game; it is inspired from a near-miss 
accident during the 2007 DARPA Urban Challenge. Consid- 
ering the traffic situation illustrated in Fig. [T] where two au- 
tonomous vehicles (marked A and B) are about to enter an 
intersection (I), the road segments are discretized into a uni- 
form grid with cell size 5 m x 5 m and the speed of each ve- 
hicle is also discretized uniformly into 5 levels ranging from 
m/s to 4 m/s. So, in each stage, the system's state can be 
characterized as a tuple {Pa, Pb, Sa> <Sb} specifying the cur- 
rent positions (P) and velocities (S) of A and B, respectively. 
In addition, our vehicle (A) can either accelerate (+1 m/s 2 ), 
decelerate (—1 m/s 2 ), or maintain its speed (+0 m/s 2 ) in each 
time step while the other vehicle ( B) changes its speed based 
on a parameterized reactive model [Gipps, 1981 1: 

Distance(P A ,P B )-'rS , B 

Vsafe - JB H g / . , 

DB/d + T 

v des = min(4, S B +a,v safe ) 
S B ~ Uniform(max(0, Wdes — era), ^des) • 



In this model, the driver's acceleration a £ [0.5 m/s 2 , 3 m/s 2 ], 
deceleration d E [—3 m/s 2 ,— 0.5 m/s 2 ], reaction time r € 
[0.5s, 2s], and imperfection a 6 [0,1] are the unknown 
parameters distributed uniformly within the corresponding 
ranges. This parameterization can cover a variety of drivers' 
typical behaviors, as shown in a preliminary study. For a fur- 
ther understanding of these parameters, the readers are re- 
ferred to [Gipp s7l981| . Besides, in each time step, each ve- 
hicle X <G {A, B} moves from its current cell px to the next 
cell P x with probability 1/t and remains in the same cell with 
probability 1 — 1/t where t is the expected time to move for- 
ward one cell from the current position with respect to the cur- 
rent speed (e.g., t — 5/Sx)- Thus, in general, the underlying 
stochastic game has 6x6x5x5 = 900 states (i.e., each ve- 
hicle has 6 possible positions and 5 levels of speed), which is 
significantly larger than the settings in previous experiments. 
In each state, our vehicle has 3 actions, as mentioned previ- 
ously, while the other vehicle has 5 actions corresponding to 
5 levels of speed according to the reactive model. 



B 

_Da_ 

1 Db 



Figure 1: (Left) A near-miss accident during the 2007 
DARPA Urban Challenge, and (Right) the discretized envi- 
ronment: A and B move towards destinations Da and Db 
while avoiding collision at I. Shaded areas are not passable. 

The goal for our vehicle in this domain is to learn the other 
vehicle's reactive model and adjust its navigation strategy ac- 
cordingly such that there is no collision and the time spent to 
cross the intersection is minimized. To achieve this goal, we 
penalize our vehicle in each step by —1 and reward it with 
50 when it successfully crosses the intersection. If it collides 
with the other vehicle (at I), we penalize it by —250. The 
discount factor is set as 0.99. We evaluate the performance of 
I-BRL in this problem against 100 different sets of reactive 
parameters (for the other vehicle) generated uniformly from 
the above ranges. Against each set of parameters, we run 20 
simulations (h = 100 steps each) to estimate our vehicle's 
average performancaHp. In particular, we compare our algo- 
rithm's average performance against the average performance 
of a fully informed vehicle (Upper Bound) who knows ex- 
actly the reactive parameters before each simulation, a ratio- 
nal vehicle (Exploit) who estimates the reactive parameters 
by taking the means of the above ranges, and a vehicle em- 
ploying BPVI flChalkiadakis and Boutilier, 2003| (BPVI). 

The results are shown in Fig. |2k: It can be observed that 
our vehicle always performs significantly better than both 
the rational and BPVI-based vehicles. In particular, our ve- 
hicle manages to reduce the performance gap between the 

7 After our vehicle successfully crosses the intersection, the sys- 
tem's state is reset to the default state in Fig. HI (Right). 
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Figure 2: (a) Performance comparison between our vehicle 
(I-BRL), the fully informed, the rational and the BPVI vehi- 
cles (</> = 0.99); (b) Our approach's offline planning time. 



fully informed and rational vehicles roughly by half. The 
difference in performance between our vehicle and the fully 
informed vehicle is expected as the fully informed vehicle 
always takes the optimal step from the beginning (since it 
knows the reactive parameters in advance) while our vehi- 
cle has to take cautious steps (by maintaining a slow speed) 
before it feels confident with the information collected during 
interaction. Intuitively, the performance gap is mainly caused 
during this initial period of "caution". Also, since the uni- 
form prior over the reactive parameters A = {a, d, r, a} is 
not a conjugate prior for the other vehicle's behavior model 
9 s (v) — p(v\s,X), the BPVI-based vehicle has to directly 
maintain and update its belief using FDM: A = {9 S } S with 
Qs = {0 v s }v ~ Dir({n^}„) (Section [2}, instead of A = 
{a, d, t, a}. However, FDM implicitly assumes that {6 S } S 
are statistically independent, which is not true in this case 
since all 9 S are actually related by {a, d, r, a}. Unfortunately, 
BPVI cannot exploit this information to generalize the other 
vehicle's behavior across different states due to its restrictive 
FDM (i.e., independent multinomial likelihoods with separate 
Dirichlet priors), thus resulting in an inferior performance. 

5 Related Works 

In self-interested (or non-cooperative) MARL, there has been 
several groups of proponents advocating different learning 
goals, the following of which have garnered substantial sup- 
port: (a) Stability — in self-play or against a certain class 
of learning opponents, the learners' behaviors converge to 
an equilibrium; (b) optimality — a learner's behavior nec- 
essarily converges to the best policy against a certain class 
of learning opponents; and (c) security — a learner's aver- 
age payoff must excee d the maximin v alue of the game. Fo r 
examp le, the w orks of Littman 1 2001) , Bianchi et al. 120071, 
and Akchurina [20091 have focused on (evolutionary) game- 
theoretic approaches that satisfy the s tability crite r ion in self- 
play. The works of Bowling and Veloso [20011, Suematsu 
and Hayashi [20021, and Tesauro 12003 [have developed al- 
gorithms that address both the optimality and stability crite- 
ria: A learner essentially converges to the best response if the 
opponents' policies are stationa ry; otherwise, it converges in 
self-play. Notably, the work of Powers and Shoham (2005) 
has proposed an approach that provably converges to an e- 
best response (i.e., optimality) against a class of adaptive, 
bounded-memory opponents while simultaneously guaran- 



teeing a minimum average payoff (i.e., security) in single- 
state, repeated games. 

In contrast to the above-mentioned works that focus on 
convergence, I-BRL directly optimizes a learner's perfor- 
mance during its course of interaction, which may terminate 
before it can successfully learn its opponent's behavior. So, 
our main concern is how well the learner can perform be- 
fore its behavior converges. From a practical perspective, this 
seems to be a more appropriate goal: In reality, the agents 
may only interact for a limited period, which is not enough 
to guarantee convergence, thus undermining the stability and 
optimality criteria. In such a context, the existing approaches 
appear to be at a disadvantage: (a) Algorithms that focus 
on stability and optimality tend to select exploratory ac- 
tions with drastic ef fect without considering their huge costs 
(i.e., poor rewards) I Chalkiadakis and Boutilier, 20031; and 
(b) though the notion of security aims to prevent a learner 
from selecting such radical actions, the proposed security val- 
ues (e.g., maximin value) may not always tur n out to be tight 
lower b ounds for the optimal performance [Hoang and Low, 
2013a|. Interested readers are referred to [Chalkiadakis and 



Boutilier, 2003 1 and Appendix C for a detailed discussion and 
additional experiments to compare performances of I-BRL 
and these approaches, respectively. 

Note that while solving for the Bayes-optimal policy effi- 
ciently has not been addressed explicitly in general prior to 
this paper, we can actually avoid this problem by allowing 
the agent to act sub-optima lly in a bounded n umber of steps. 
In particular, the works of Asmuth and Littman 120111 and 
Araya-Lopez et al. 1 2012| guarantee that, in the worst case, 
the agent will act nearly approximately Bayes-optimal in all 
but a polynomially bounded number of steps with high prob- 
ability. It is thus necessary to point out the difference be- 
tween I-BRL and these worst-case approaches: We are in- 
terested in maximizing the average-case performance with 
certainty rather than the worst-case performance with some 
"high probability" guarantee. Comparing their performances 
is beyond the scope of this paper. 



6 Conclusion 

This paper describes a novel generalization of BRL, called 
I-BRL, to integrate the general class of parametric mod- 
els and model priors of the opponent's behavior. As a re- 
sult, I-BRL relaxes the restrictive assumption of FDM that 
is often imposed in existing works, thus offering practition- 
ers greater flexibility in encoding their prior domain knowl- 
edge of the opponent's behavior. Empirical evaluation shows 
that I-BRL outperforms a Bayesian MARL approach utilizing 
FDM called BPVI. I-BRL also outperforms existing MARL 
approaches focusing on conver gence (Section[5]l, as show n in 
the additional experiments in [Ho ang and Low, 2013a| . To 
this end, we have successfully bridged the gap in applying 
BRL to self-interested multi-agent settings. 
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A Proof Sketches for Theorems |2] and |3] 

This section provides more detailed proof sketches for Theo- 
rems |2] and [3] as mentioned in Section [3] 

Theorem 2. The optimal value function V k for k steps -to-go 
converges to the optimal value function V for infinite horizon 
asfc^oc: HV-V^^loo^^lV-^lloo. 
Proof Sketch. Define L k (b) = |V^(&) — V^ fe (&)|. Using 
|max a /(a) - max a g(a)| < max a \f(a) - g(a)\, 

L k+1 (b) < tmBxYtW^PrWtfAK) 

U z ' 

v,s' 

>max^(^,&)p»V)ll^-^lloo 

it ' " 



< 



•||V-V*|| 



(16) 



Since the last inequality ( p"6*| ) holds for every pair (s, b), it 
follows that || V -V k+1 1| «, < cj)\\V-V k \\ 00 . 

Theorem 3. The optimal value function V k (b) for k steps- 
to-go can be represented as a finite set T k of a-functions: 



V s k (b) = max (a s ,b) 



(17) 



Proof Sketch. We give a constructive proof to $FJ\ by induc- 
tion, which shows how T k can be built recursively. Assuming 
that ( fT7| > holds for kfj it can be proven that ( fTTj i also holds for 
k + 1. In particular, it follows from our inductive assumption 
that the term V k (b v s ) in (|5j can t> e rewritten as: 



$(K) 



mkU ai,(X)b:(X)dX 

3 = 1 Jx 

l r M /" j ^n^(A)6(A) ,. 
= max / 06(A) , ,, dA 
i=i A s (PS, 6) 

= (p s u , 6)- 1 max / b{\)a{,{\) P :{\)dX 
J= l Jx 

By plugging the above equation into (|5]l and using the fact 
that r s b (u) =J2v(Psi b ) r s( u ' v )> 



withi = (t s > v ) 



s'£S,v£V 



and 



af(A)=^K(A) r,( V ) + ^E at /"( A W 8 («') • 

(20) 
By setting T^ +1 = {«"*}„,( and u Q ut = u, it can be verified 
that (TT7) also holds for k + 1. 

B Alternative Choice of Basis Functions 

This section demonstrates another theoretical advantage of 
our framework: The flexibility to customize the general point- 
based algorithm presented in Section [3~2| into more manage- 
able forms (e.g., simple, easy to implement, etc.) with respect 
to different choices of basis functions. Interestingly, these 
customizations often allow the practitioners to trade off ef- 
fectively between the performance and sophistication of the 
implemented algorithm: A simple choice of basis functions 
may (though not necessarily) reduce its performance but, in 
exchange, bestows upon it a customization that is more com- 
putationally efficient and easier to implement. This is es- 
pecially useful in practical situations where finding a good 
enough solution quickly is more important than looking for 
better yet time-consuming solutions. 

As an example, we present such an alternative of the basis 
functions in the rest of this section. In particular, let {A l }™ =1 
be a set of the opponent's models sampled from the initial 
belief b. Also, let ^fi(X) denote a function that re turn s 1 



3.2 



to 



if A = A 1 , and otherwise. According to Section 
keep the number of parameters from growing exponentially, 
we project each a-function onto {^(A)}™^ by minimizing 
( fT3] l or alternatively, the unconstrained squared difference be- 
tween the a-function and its projection: 



J(a s ) 




I B.I 

a.(A) -£>*<(*) 



dA. (21) 



Now, let us consider ([8]), which specifies the exact solution 
for |5]) in Section BJ Assume that a s s ,' v (X) is projected onto 
{*j(A)}f =1 by minimizing ( TIT) : 



v. 



\r k ,\ 



k+ \b) = max rt(u) + ^ mfyp™ (s')Q v M>, b ) 

u I ^ — ' .1 = 1 



(18) 

where Q^(a J s ,,6) = f x a j sl (X)p v s (X)b(X)dX. Now, applying 
the fact that 

^2 X! t^-x As ' v [ is '"] = m f x X! X! As ' v [**' «] 

s f v s' v 

where A 8 > v [t slv ] = p" v {s')Q v {g ^' v , b) and using r£(w) = 
Jx &(A) J2 v P 1 sW r s(. u ' ^)dA, ( p~8] > can be rewritten as 

V k+1 (b) = max / b{X)af(X)dX (19) 

".* Jx 



8 When k — 0, (Tf\ can be verified by letting a s (A) = 



a s >;"(\) = J2*iW<P t s>'*(i) 



(22) 
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where {<p s l' v (i)}i are the projection coefficien ts. A ccording 

'(A) is 



to the general point-based algorithm in Section 3.2 

first computed by replacing a/ 1 ' (A) with a B ',' v (A) in ([8]): 

af(A)=^>:(A) r ! («,i;) + ^a t /'(AK«( S ') . 

v \ s' / 

(23) 
Then, following pTj ), a"' (A) is projected onto {^ / i(A)}i by 
solving for {<<?"* (i)}i that minimize 



J(»T) 



lj(af(\)-J2*iW<P?(i)\ dA. (24) 



The back-up operation is therefore cast as finding {<£>"' (i)}i 
that minimize p4h. To do this, define 



L(X) = 



1 



£ (A)-^^(AK*(i 



and take the corresponding partial derivatives of L(X) with 
respect to {<pf(j)}f 



dL(X) 
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'(A)-£^(A)^« Hi(A). (25) 



From the definition of vfj(A), it is clear that when A 7^ A J , 

dL{X) 



dipf(j) 



= 0. Otherwise, this only happens when 



a?(X>) 



5> 4 (A>f(z 



i=l 
jut 



^f(j)(bydef.of^(A)) (26) 



On the other hand, by plugging ( |22] > into <J23J> and using 
<^(A) = VA 7^ A 4 , af (A J ') can be expressed as 



So, to guarantee that 



dL(X) 



= 0VA,j (i.e., minimizing 



( p4| > with respect to {<p"'(j)}j), the values of {^s'(j)}j can 
be set as (from d26l) and the above equation) 



v.- 



\j) = ^(A*) r.(«,t;) + *5>rV)¥#"0') 



Surprisingly, this equation specifies exactly the a-vector 
back-up operation for the discrete version of |5]): 

w \ s' / 

(27) 
where 6 is the discrete distribution over the set of samples 
{X^}j (i.e., J2jK^) = !)• This implies that by choosing 
{^^(A)}^ as our basis functions, finding the corresponding 
"projected" solution for |5]) is identical to solving ( |27) , which 
can be easily implemented using a ny of th e existing discrete 
POMDP solvers (e.g., jPineau e7a/., 2003[). 



C Additional Experiments 

This section provides additional evaluations of our proposed 
I-BRL framework, in comparison with existing works in 
MARL, through a series of stochastic games. In particular, I- 
BRL is evaluated in two small yet typical application domains 
that are widely used in the existing works (Sections C. 1 C.2[>. 



C.l Multi-Agent Chain World 

In this problem, the system consists of a chain of 5 states 
and 2 agents; each agent has 2 actions {a, b}. In each stage 



of interaction, both agents will move one step forward or go 
back to the initial state depending on whether they coordinate 
on action a or b, respectively. In particular, the agents will 
receive an immediate reward of 10 for coordinating on a in 
the last state and 2 for coordinating on b in any state except 
the first one. Otherwise, the agents will remain in the current 
state and get no reward (Fig.pll. After each step, their payoffs 
are discounted by a constant factor of < <\> < 1. 




Figure 3: Multi -Agent Chain World Problem. 

In this experiment, we compare the performance of I- 
BRL with the state-of-the-art frameworks in MA RL which 
includ e BPVI IChalkiadakis and Boutilier, 20031, Hyper-Q 
iTesauro, 2003], and Meta-Strategy iPowers and Shoham, 
20051. Among these works, BPVI is the most relevant to I- 
BRL (SectionfTl). Hyper-Q simply extends Q-learning into the 
context of multi-agent learning. Meta-Strategy, by default, 
plays the best response to the empirical estimation of the op- 
ponent's behavior and occasionally switches to the maximin 
strategy when its accumulated reward falls below the max- 
imin value of the game. In particular, we compare the av- 
erage performance of these frameworks when tested against 
100 different opponents whose behaviors are modeled as a set 
of probabilities 6 S — {9 V S } V ~ Dir({n£}„) (i.e., of selecting 
action v in state s). These opponents are independently and 
randomly generated from these Dirichlet distributions with 
the parameters ri 1 !, = 1/\V\. Then, against each opponent, we 
run 20 simulations (h — 100 steps each) to evaluate the per- 
formance R of each framework. The results show that I-BRL 
significantly outperforms the others (Fig. [4^). 
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Figure 4: (a) Performance comparison between I-BRL, 
BPVI, Hyper-Q, and Meta-Strategy (0 = 0.75), and (b) I- 
BRL's offline planning time. 

From these results, BPVI's inferior performance, as com- 
pared to I-BRL, is expected because BPVI, as mentioned in 
Section fT] relies on a sub-optimal myopic information-gain 
functionTDearden et al., 19981, thus underestimating the risk 
of moving forward and forfeiting the opportunity to go back- 



ward to get more information and earn the small reward. 
Hence, in many cases, the chance of getting big reward (be- 
fore it is severely discounted) is accidentally over-estimated 
due to BPVI's lack of information. As a result, this makes 
the expected gain of moving forward insufficient to compen- 
sate for the risk of doing so. Besides, it is also expected that 
Hyper-Q's and Meta-Strategy's performance are worse than I- 
BRL's since they primarily focus on the criteria of optimality 
and security, which put them at a disadvantage in the context 
of our work, as explained previously in SectionfT] Notably, in 
this case, the maximin value of Meta-Strategy in the first state 
is vacuously equal to 0, which is effectively a lower bound 
for any algorithms. In contrast, I-BRL directly optimizes the 
agent's expected utility by taking into account its current be- 
lief and all possible sequences of future beliefs (see |5j). As 
a result, our agent behaves cautiously and always takes the 
backward action until it has sufficient information to guaran- 
tee that the expected gain of moving forward is worth the risk 
of doing so. In addition, I-BRL's online processing cost is 
also significantly less expensive than BPVI's: I-BRL requires 
only 2.5 hours to complete 20 simulations (100 steps each) 
against 100 opponents while BPVI requires 4.2 hours. In ex- 
change for this speed-up, I-BRL spends a few hours of offline 
planning (Fig. H ->), which is a reasonable trade-off consider- 
ing how critical it is for an agent to meet the real-time con- 
straint during interaction. 

C.2 Iterated Prisoner Dilemma (IPD) 

The IPD is an iterative version of the well-known one-shot, 
two-player game known as the Prisoner Dilemma, in which 
each player attempts to maximize its reward by cooperating 
(C) or betraying (B) the other. Unlike the one-shot game, the 
game in IPD is played repeatedly and each player knows the 
history of his opponent's moves, thus having the opportunity 
to predict the opponent's behavior based on past interactions. 
In each stage of interaction, the agents will get a reward of 
3 or 1 depending on whether they mutually cooperate or be- 
tray each other, respectively. In addition, the agent will get 
no reward if it cooperates while his opponent betrays; con- 
versely, it gets a reward of 5 for betraying while the opponent 
cooperates. 
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Figure 5: (a) Performance comparison between I-BRL, 
BPVI, Hyper-Q and Meta-Strategy (0 = 0.95); (b) I-BRL's 
offline planning time. 

In this experiment, the opponent is assumed to make his 
decision based on the last step of interaction (e.g., adaptive, 
memory-bounded opponent). Thus, its behavior can be mod- 
eled as a set of conditional probabilities 9 g = {9g} v ~ 



Dir({n^}„) where s = {a_i,o_i} encodes the agent's and 
its opponent's actions a_i,0_i G {B, C} in the previous 
step. Similar to the previous experiment (Section | C1|, we 
compare the average perfo rmance of I-BRL, BPVllChalki- 
adakis and Boutilier, 20 03), Hyper-Q |Tesauro, 2003) , and 
Meta-Strategy jPowers and Shoham, 2005| when tested 
against 100 different opponents randomly generated from the 
Dirichlet priors. The results are shown in Figs. [5^ and[5]3. 
From these results, it can be clearly observed that I-BRL also 
outperforms BPVI and the other methods in this experiment, 
which is consistent with our explanations in the previous ex- 
periment. Also, in terms of the online processing cost, I-BRL 
only requires 1.74 hours to complete all the simulations while 
BPVI requires 4 hours. 



